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Abstract: 

In this paper, we have taken the problem of DNA sequencing as an optimization problem and 
also proposed a combinatorial approach to get the original DNA sequence. For that, we consider 
the path in a weighted graph to maximize the travelling cost in solving the TSP having different 
intercity cost. This is a Hamiltonian path which gives the optimal solution to the DNA 
sequencing problem. 
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Deoxyribonucleic acid (DNA) contains the genetic codes that are passed from generation 
to generation. It consists of two strands connected by hydrogen bonds, each of which contains 
nucleotides from Z = [A,C,G,T] called alphabet, where A,C,G,T are for Adenine, Cytosine, 
Guanine and Thymine respectively. Each nucleotide in a strand is connected to a complementary 
nucleotide in the other strand, where A = T , C = G and vice versa. There are three main areas in 
the field of DNA: DNA Sequencing, DNA Assembling and DNA mapping. It is said that the 
discovery of a DNA structure by Watson and Crick [14] has restructured the modern biology. 
DNA sequencing technologies have been available since the 1970s and are still evolving. Sanger 
and Gilbert receive the 1980 Nobel Prize for DNA sequencing methods. 

The DNA sequencing problem is to determine a sequence (string) of nucleotides 
(symbols) drawn from the set Z = [A,C,G,T] [3, 4, 5]. Here the input data can be viewed as a 

set (called spectrum) of words (called fragments or oligonucleotides) that comes from a 
biochemical hybridization experiment. These fragments usually have overlap as well as of 
varying length. A spectrum is said to have positive errors/negative error, if fragments 
present/absent in the spectrum but absent/present in the original sequence. Repetitions of 
fragments in the sequence are also treated as negative errors. 

Error occurring during the hybridization experiment shows an important role in reconstructing 
the original sequence. The computational complexity of various variants of the problem is 
already known. The variant with no error {ideal spectrum) is polynomially solvable [7] and the 
variants with error present in the spectrum become NP-hard [3]. The two most popular methods 
for DNA sequencing are the Sanger method and the Sequencing by Hybridization (SBH) 
method. The aim is to reconstruct the original DNA sequence of a known length 'n' on the basis 
of the overlapping words. 



The second section of this paper contains a short review of SBH along with two methods which 
are mentioned in [2], the ones basing on approaches from graph theory. A short review of DNA 
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sequencing problem as shortest common superstring (SCS) problem and its representation as an 
optimization problem is discussed in the third section. In the forth section, we propose a new 
approach to solve SCS problem for a spectrum with varying length fragments which may be 
converted to Travelling Salesman Problem (TSP) with maximum cost. Finally, the fifth section 
includes conclusions. 



2 Sequencing by Hybridization (SBH) 

Sequencing by hybridization (SBH) is one of the most popular methods from the 
computational molecular biology domain. In SBH, assumptions are- spectrum is ideal one and 
fragments are all / -mers (that is of equal length / ) composing the original sequence. 
Different sequences may have same spectrum: that is, we may have s l * s 2 such that 

spectrum (s^ I) = spectrum (s 2 , I). 

In case of no errors in the spectrum S , each fragment intersects with another in exactly / - 1 
positions and the total length 'n 'of the final sequence can be calculated as n =1 S I +1— 1 . 

2.1 Methods 

Several methods for DNA sequencing problem with constant length oligonucleotide 
library are drawn in [2], each of them base on approaches from graph theory. 

Method 2.1.1 This method refers to a well-known problem from graph theory mentioned in [13]. 
In this method, the Hamiltonian path is searched for in a directed graph. We construct a directed 
graph having each vertex in the graph corresponds to each element of the spectrum. Two vertices 
u and v are connected by the arc (w,v) if last I— 1 letters of the label (fragment) of u overlap 

the first / - 1 letters of the label of v . This method may correspond to more than one 
Hamiltonian path which represent the same length sequences (example 1). 
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Example 1 Suppose a hybridization experiment generates the ideal spectrum S = {ATG, TGG, 
TGC, GTG, GGC, GCA, GCG, CGTJ.The graph from method 2.1.1 is presented in Fig.l. 



AT' 
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-€GT M3TG >TGG ■ 



Fig. 1 . A graph from method 2.1.1. 
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In this graph we have two Hamiltonian paths: 

Path 1: (ATG-TGG-GGC-GCG-CGT-GTG-TGC-GCA) which results ATGCGTGGCA and Path 

2: (ATG-TGC-GCG-CGT-GTG-TGG-GGC-GCA) which results ATGGCGTGCA. 

The length of the final sequence can be obtained as 

n=\S\+l-l 
-8+3-1 
= 10 

This method accept ideal spectrum as input data and leads to an exponential-time 
algorithm. The first and only polynomial-time algorithm in solving constant length ideal 
spectrum was presented in [10]. In this method 2.1.2 (below), the Eulerian path is searched for in 
a directed graph based on the spectrum. 



Method 2.1.2 Suppose the ideal spectrum S has / -mer fragments. We build a directed graph for 
S as follows: Vertices corresponds to all (/-l) mers/ (/-l) tuples. For each / -mer in spectrum 

add edge from vertex representing first (/-l) characters to vertex representing last (/-l) 

characters i.e. edges correspond to the fragments in S . A fragment acob (where a , b are letters) 
forms an edge from the prefix node aa> to the suffix node cob . This method may also correspond 
to more than one Eulerian path which represents the same length sequences (example 2). 

Example 2 Suppose a hybridization experiment generates the ideal spectrum S = {ATG, TGG, 
TGC, GTG, GGC, GCA, GCG, CGT] as in example 1. The graph from method 2.1.2 is 
presented in Fig. 2. 
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Fig. 2. a A graph obtained from method 2.1.2 and solution is ATGCGTGGCA. 
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Fig.2.b A graph obtained from method 2.1.2 and solution is ATGGCGTGCA. 
In this graph, we have two Eulerian paths: 

Path 1: (1-2-3-4-5-6-7-8) = (AT.TG.GC.CG.GT.TG.GG.GC.CA) which results 

ATGCGTGGCA 

and 

Path 1: (1-2-3-4-5-6-7-8) = (AT.TG.GG.GC.CG.GT.TG.GC.CA) which results 
ATGGCGTGCA. 

The length of the final sequence can be obtained as 

n=\S\+l-l 
-8+3-1 
-10 
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This method reduces the complexity of the algorithm in solving the DNA sequencing 
problem because finding an Eulerian path can be done in polynomial time. 



3 DNA sequencing problem as SCS problem 



The DNA sequencing problem can also be stated as the problem of constructing a string 
over Z = {A,C,G,r} from a given spectrum (not necessarily an ideal spectrum) 

S ={s lf s 2 ,s 3 , '",s n \ , so that the resulting string is the shortest string which contains as many of 

the fragments in the spectrum as possible. This problem is called the shortest common 
superstring (SCS) problem. 



3.1 SCS problem as an optimization Problem 



If we consider, S = {s l ,s 2 ,s 3 ,- ■ -,s n } over Z = {A,C,G,T] then 
Solution: Strings that contains all s l of S . 

Cost: Length of a string. 
Goal: Length is minimum. 

Without loss of generality we assume that S ={s 1 ,s 2 ,s 3 ,---,s n } is factor free, i.e. there are no 
strings s^s. ^S,i¥= j such that s. is a substring of Sj. 



4 SCS problem to solve TSP 

The Travelling Salesman Problem (TSP) with maximum cost can be explained as 
follows: Suppose a traveller salesman wants to travel among n cities in such a way that: 

1. He visits all the cities only once. 

2. He visits each of the cities only once. 

3. He starts from the city C\ and ends in the city C n . 
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4. He wants to maximize his total cost. 

Considering these conditions, we may relate the DNA sequencing problem and the TSP 
along with graph theory using the following approach: 



4.1 A NEW APPROACH 

This approach includes three steps. 

Step I: For (given) a spectrum we define a complete weighted graph K = (V,E,W) where, 
V = C (One vertex per city C i and each city C, is assigned with a fragment s t ). 
E = Edges between all vertices (a complete graph) . 

I \ t \ f ' w a ' if J * J 

W[C.C.) = overlap I s t s • I = < where s } = xw r ,s, = w u y 

v ' { if i = j 

In K = (V, E,W), every Hamiltonian path determines a superstring and vice versa. Now, The 

length (L) of SCS (defined by the Hamiltonian path) = total length of the strings (7) - sum of the 
weights of the edges (W), i.e. 

LSCSI=Ikl-Zlw ;; l 



L = T-W 



■■T-W„ 



(1) 



Step II: To find W max , we define an weighted matrix M = (a^ ) where a ij = W \C i C j ) . 

Step III: Using the matrix M = (aA, we develop an algorithm having the following 
steps(l to 8). 

1 . Find the entry a jj with maximum value in the matrix M =(«*, ) • 

2. Merge the two vertices C, and Cj into a single vertex C ( C y and also merge the corresponding 
strings s t and s. into a single string s t Sj = xw^y . 

3. W =Ta... 

max ^ ij 

4. Construct the matrix M = (a^ ) for the graph with vertices Cj , C 2 , • • • , C^j ,--,C n as in step 2. 

5. Find the size nxn of the matrix M . 
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6. If the size of the matrix M is nxn = lxl, go to step 7, otherwise go to step II. 

7 . The maximum cost is W,, and the SCS = s,---s ; s , • ■ • s„ for the path C, • • • CC , ■ ■ • C„ . 

lUdA L I j n A 1 / 7 ft 

8. STOP. 

The step 7 gives the optimal value W max and the optimal path (Hamiltonian path) 

C l ---C i C ---C n for the travelling salesman problem with maximum cost. The path 

C ] - C j Cj - C n results the shortest common superstring SCS — s 1 ---s i s j ■••s n . Since T is the 

total length of the strings which is fixed by the problem, hence constant for all Hamiltonian path 
and hence from (1), we obtain the length of the SCS. 

Example 3: Suppose a hybridization experiment generates the spectrum 

S = {CATGC,CTAAGT,GCTA,TTCA,ATGCATC) with varying length. We need to find the 
SCS having these fragments: Assume that the vertex set V = [C l ,C 2 ,C 3 ,C 4 ,C 5 ] where C i 
represents a city, is labelled by the respective element of the spectrum 

S = {CATGC,CTAAGT,GCTA,TTCA,ATGCATC] . Now, using the algorithm, we have the 
complete weighted graph K 5 : 




Fig. 3 K 5 
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The weighted matrix for K 5 is 



M 
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Next, the vertex set becomes V = {C l C 5 ,C 2 ,C i ,C A ) which is labelled by the spectrum 

S = [CATGCATC, CTAAGT, GCTA, TTCA] . 

For that, we have the complete weighted graph K 4 : 




Fig. 4 K 4 



The weighted matrix K 4 
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Next, the vertex set becomes V = [C 1 C 5 ,C 4 ,C 3 C 2 \ which is labelled by the spectrum 

S = {CATGCATC, TTCA, GCTAAGT) . For that, we have the complete weighted graph K 3 



The weighted matrix K 3 




Fig. 5 K 3 
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Next, the vertex set becomes V = [C 4 C l C 5 ,C 3 C 2 ] which is labelled by the spectrum 
S = {TTCATGCATCGCTAAGT} . For that, we have the complete weighted graph K 2 : 





The weighted matrix K 2 
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Thus the optimal path (Hamiltonian path) is C 3 C 2 C 4 C i C 5 for the travelling salesman problem 

with the optimal value (maximum cost) W max = 4 + 3 + 2 + 1 = 10. The path C 3 C 2 C 4 C 1 C 5 results 

the shortest common superstring SCS = GCT AAGTTCATGCATC . 
Here T = 26 and hence from (1), we obtain the length of the SCS. 

L =T-W 

min max 

= 26-10 
= 16 

Thus, a solution to the SCS problem translates into finding a Hamiltonian path of maximum 
weight. Also the total length of the strings is fixed by the problem, hence constant for all 
Hamiltonian paths and therefore it has been converted to Travelling Salesman Problem (TSP). 

5 CONCLUSIONS 



There are also other methods for DNA sequencing by hybridization developed by various 
researchers taking positive or negative spectrum with constant or variable length fragments. In 
this paper, the aim of the proposed method is searching for a path with maximum weight in the 
weighted graph. The path can be easily translated to a DNA sequence and is also viewed via 
shortest common superstring (SCS) problem by finding a Hamiltonian path which may be 
converted to Travelling Salesman Problem (TSP) with maximum cost. The proposed method has 
some disadvantages as it cannot tell about the uniqueness of the solution, i.e. whether the 
obtained result covers the original sequence or not. This problem of the uniqueness of the 
solution was disclosed earlier in a number of papers like [7, 8, 9]. 
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